Overview

Within the world data science and analytics there lie a plethora of techniques and strategies at the finger tips of the analyst. Below I will display a few of the more useful techniques I have used working as a scientist and a student. While each dataset holds unique parameters and presenting data in a useful format varies by each audience, using basic datasets should showcase these methods well. As most analysis work is completed on confidential data, many of my professional projects are currently hidden from public view. As such, below please find a sampling of the techniques I have used, but here displayed on public data.


Table of Contents

  1. Data Breakdown

  2. Time Series Analysis

  3. Spatial Analysis

  4. Plotly 3D

  5. Correlation Analysis


Data Breakdown


One of the most important steps in any analysis is to understand the data being analyzed. This includes understanding the dataset itself, the limitations of the data, how it was collected, any biases represented in it, and industry knowledge surrounding it. When first surveying any data, a simple analysis can go a long way in preventing future headache. If one can maintain site of the finish line in a maze, it will be easier to find it when digging among the weeds.

library(pastecs)
library(ggplot2)
library(gridExtra)
library(ggthemes)
library(RColorBrewer)
library(gifski)
library(gganimate)
library(transformr)
library(spData)
library(leaflet)
library(plotly)
library(ggbiplot)
library(dplyr)
options(scipen = 1000)
options(digist = 3)

AirQuality <- airquality

head(AirQuality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
str(AirQuality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
AirQuality$Month[!duplicated(AirQuality$Month)]
## [1] 5 6 7 8 9
stat.desc(AirQuality[,1:4])
##                     Ozone       Solar.R         Wind          Temp
## nbr.val       116.0000000   146.0000000  153.0000000   153.0000000
## nbr.null        0.0000000     0.0000000    0.0000000     0.0000000
## nbr.na         37.0000000     7.0000000    0.0000000     0.0000000
## min             1.0000000     7.0000000    1.7000000    56.0000000
## max           168.0000000   334.0000000   20.7000000    97.0000000
## range         167.0000000   327.0000000   19.0000000    41.0000000
## sum          4887.0000000 27146.0000000 1523.5000000 11916.0000000
## median         31.5000000   205.0000000    9.7000000    79.0000000
## mean           42.1293103   185.9315068    9.9575163    77.8823529
## SE.mean         3.0628482     7.4532881    0.2848178     0.7652217
## CI.mean.0.95    6.0669128    14.7311225    0.5627128     1.5118439
## var          1088.2005247  8110.5194143   12.4115385    89.5913313
## std.dev        32.9878845    90.0584222    3.5230014     9.4652697
## coef.var        0.7830151     0.4843634    0.3538032     0.1215329


Here we can see we are dealing with a simple dataframe with four variables and time points for each observations. This is broken into month and day columns. This is good to know because we may want another column set as “Julian Day” to indicate days in a time series, not just within a month.
Let us look at the breakdown of each element with some basic characteristics.

Blues <- brewer.pal(9,"Blues")

OrRd <- brewer.pal(9,"OrRd")

YlOrRd <- brewer.pal(9,"YlOrRd")

Greys <- brewer.pal(9,"Greys")


grid.arrange(
  ggplot(data=AirQuality, aes(Ozone)) +
    geom_histogram(bins = 15, col = Blues[8], fill = Blues[6], alpha=0.7)+
    xlab("Ozone")+
    ylab("Frequency")+
    theme_hc()+
    ylim(0,35),
  ggplot(data=AirQuality, aes(Solar.R))+
    geom_histogram(bins = 15, col = YlOrRd[4], fill = YlOrRd[2], alpha=0.7)+
    xlab("Solar Radiation") +
    ylab("")+
    theme_hc()+
    ylim(0,35),
  ggplot(data=AirQuality, aes(Wind))+
    geom_histogram(bins = 15, col = Greys[6], fill = Greys[4], alpha=0.7)+
    xlab("Wind Speed")+
    ylab("Frequency")+
    theme_hc()+
    ylim(0,35),
  ggplot(data=AirQuality, aes(Temp))+
    geom_histogram(bins = 15, col = OrRd[8], fill = OrRd[6], alpha=0.7)+
    geom_density()+
    xlab("Temperature")+
    ylab("")+
    theme_hc()+
    ylim(0,35),
  nrow=2,
  ncol=2,
  top = paste("Histograms of Variables in Air Quality Dataset \n ", sep = "\n"))


With the above information, we know a bit more about what our dataset holds. We know it contains just over 150 records, which is not a lot but certainly enough to perform a simple analysis on, and it has four variables that are recorded. This dataset is recorded as a time series, so we also have two columns to track the day and month. Looking closer at each variable, we can see all of the basic statistics we would want to know, and then see these visually represented in a Histogram of each variable.
Depending on the direction we would want to take this analysis, this would aid us in determining what we can and cannot do, as well as tell us if we need more data in order to continue.



Time Series Analysis


Now that we have some basic statistics on the spread of the data itself, we can begin viewing data from various angles. Here, we are fortunate enough to have a dataset that includes a column for time of collection, so we will continue to use the AirQuality dataset built in to RStudio. This will allow us to inspect the data from the perspective of change over time, to determine if there may be any trends present. Analysis allows us to come up with some ideas for future analysis, places we want to focus our effort so to speak. While these trends have not been statistically tested yet, and so we cannot draw conclusions from them, a good analysis will point to areas of interest where we will want to focus our future statistical efforts.
For example, when in a business or scientific setting, resources (especially time) are always limited. A good analysis will highlight the areas which hold the most promise, where that limited capacity can be used to the fullest extent. If a particular area of interest displays a trend within the small initial dataset, that may be an area where a full statistical review with a fully formed, new dataset are in order. While we cannot create hypotheses on a trend in and dataset and use the same dataset to test that same trend, a good analysis can give us a direction where our efforts will be most fruitful.
In this analysis, we will first create a column to specify what would be the “Julian Day”, which will allow us to graph this series easily. Secondly, we will create a second column for each of the four variables. This column will place the value as a percentage between minimum and maximum for that column, allowing us to overlay the data and view if any trends are present. All four variables are related in some way, as each atmospheric condition affects the others. However, we must note that altering axes can be dangerous and the graphics must be taken with a grain of salt. They are merely a tool to allow us to inspect the data, and if we determine there may be some sort of pattern it does not necessitate a pattern, but only gives us more grounds to investigate that possibility. To create these “percentage” columns, we will use the following equation:

\[1-\displaystyle \frac{range - (value - min)}{range}\]

This will produce a number between 0 and 1, with the value displayed as a fraction of the whole. The maximum value in the set will equal 1, while the minimum value in the set will equal 0.
Before we complete that secondary analysis bringing the elements together, let us look at the data all togather:

stack_AQ <- stack(AirQuality[,1:4]) %>%
  mutate(Day = rep(1:153,4))

ggplot(
  stack_AQ, 
  aes(Day, values, fill = factor(ind))
) +
  geom_point(pch=21, size =3)+
  labs(x = "Experiment Day", y = "", fill = "")+
  theme_hc()+
  scale_color_manual(values = c(Blues[6],YlOrRd[2],Greys[4],OrRd[6]),
                     aesthetics = "fill",
                     labels = c("Ozone","Solar Radiation","Wind","Temperature")
  )



AirQuality$DaySeries <- as.numeric(rownames(AirQuality))

PercEqu <- function(x) {
  
  y <- 1- (((max(x, na.rm=T) - min(x, na.rm=T)) - (x - min(x,na.rm=T)))/(max(x, na.rm=T) - min(x, na.rm=T)))
  return(y)
  
}

Percents <- data.frame(apply(AirQuality[,1:4],2,PercEqu))
colnames(Percents) <- paste(colnames(Percents),"_p",sep="")
AirQuality <- data.frame(cbind(AirQuality,Percents))



grid.arrange(
  ggplot(AirQuality, aes(DaySeries,Ozone_p))+
    geom_bar(stat="identity", col = Blues[8], fill = Blues[6])+
    theme_hc()+
    ylab("Ozone")+
    xlab(""),
  ggplot(AirQuality, aes(DaySeries,Solar.R_p))+
    geom_bar(stat="identity", col = YlOrRd[4], fill = YlOrRd[2])+
    theme_hc()+
    ylab("Solar Radiation")+
    xlab(""),
  ggplot(AirQuality, aes(DaySeries,Wind_p))+
    geom_bar(stat="identity", col = Greys[6], fill = Greys[4])+
    theme_hc()+
    ylab("Wind Speed")+
    xlab("Experiment Day"),
  ggplot(AirQuality, aes(DaySeries,Temp_p))+
    geom_bar(stat="identity", col = OrRd[8], fill = OrRd[6])+
    theme_hc()+
    ylab("Temperature")+
    xlab("Experiment Day"),
nrow=2,
ncol=2,
top = paste("Time Series of Each Variable\nExpressed as Percent of Max of Range\n", sep = "\n"))



Looking at these graphics, we can see that there may be a similarity in trends between Ozone and Temperature. This could make sense, as Ozone does affect temperature.
Let us take one final look at this in a small animated graphic below:

stack_AQ_perc <- stack(AirQuality[,c(11,8)]) %>%
                         mutate(Day = rep(1:153,2)) %>%
                         filter(!is.na(values)) %>%
                         mutate(col = ifelse(ind == "Ozone_p", Blues[8], OrRd[8]))
  
ggplot(stack_AQ_perc,
       aes(Day,values, fill = ind)) +
  geom_area(stat="identity", 
            position="identity",
            col = stack_AQ_perc$col,
            size = 0.5,
            alpha = 0.8) +
  scale_fill_manual(values = c(OrRd[6],Blues[6]), labels = c("Temperature", "Ozone")) +
  theme_hc()+
  labs(x="Experiment Day",y="", fill = "", title = "Temperature and Ozone Comparison")+
  theme(plot.title = element_text(hjust = 0.5))+
  transition_reveal(Day)



The animation gives us a good idea of how Ozone and Temperature appear to be related, showing that the rises and falls of each seem to come together.

Spatial Analysis


One of the more interesting analyses possible with R is Spatial Analysis. Although most complex spatial analysis is conducted in something like ESRI’s ArcGIS Pro, spatial analysis can also be accomplished in R, which makes integration with code quite simple. Here, we will bring in a spatial data set and visually analyze it with an interactive LeafLet map, which is often combined with an RShiny application, and then we will take it a step further and use Plotly to analyze it in three dimensions.
We will bring in the boston.c dataset from the spData package. This is a housing dataset from 1978.

Boston <- boston.c

RdYlBu <- brewer.pal(11,"RdYlBu")

pal.values.pop <- colorNumeric(
  palette = c(RdYlBu[11], RdYlBu[8], RdYlBu[4],RdYlBu[1]),
  domain = Boston$CMEDV
  
)


leaflet(width = 900) %>%
  addTiles() %>%
  addCircleMarkers(lng = Boston$LON,
                   lat = Boston$LAT,
                   radius = 7,
                   fillColor = pal.values.pop(Boston$CMEDV),
                   color = pal.values.pop(Boston$CMEDV),
                   stroke = T,
                   weight = 2,
                   fillOpacity = 0.6,
                   opacity = 1,
                   popup = paste0("Median Home Value: $",Boston$CMEDV*1000),
                   options = popupOptions(closeButton = T))



Now, we can compare median home value to Crime Rate, per capita.

pal.values.cr <- colorNumeric(
  palette = c(RdYlBu[11],RdYlBu[4],RdYlBu[1]),
  domain = Boston$CRIM
  
)

leaflet(width = 900) %>%
  addTiles() %>%
  addCircleMarkers(lng = Boston$LON,
                   lat = Boston$LAT,
                   radius = 7,
                   fillColor = pal.values.cr(Boston$CRIM),
                   color = pal.values.cr(Boston$CRIM),
                   stroke = T,
                   weight = 2,
                   fillOpacity = 0.6,
                   opacity = 1,
                   popup = paste0("Crime per capita: ",Boston$CRIM),
                   options = popupOptions(closeButton = T))



Without making an judgements and acknowledging this is a much more complex we can at least visibly demonstrate a correlation between crime rate and median house value, in Boston in 1978. Clicking on any point yields the respective value.

Plotly 3D


Plotly provides a fullset of techniques to present graphics, but the way in which I prefer to use it is to make 3D plots. ggplot2 creates fantastic, easily understood plots and graphs for normal purposes, but I find Plotly to function better in the 3D realm.

plot_ly(Boston, 
        x=~LON, 
        y=~LAT, 
        z = ~CMEDV, 
        type="scatter3d", 
        marker= list(size = 3, 
                     color = ~CMEDV,
                     colorscale =c(RdYlBu[11], RdYlBu[8], RdYlBu[4],RdYlBu[1]),
                     showscale=T)
        ) %>%
  layout(scene = list(
    xaxis=list(title="Longitude"),
    yaxis=list(title="Latitude"),
    zaxis=list(title="Median House Price")
  ))



This provides a similar view to the one provided in two dimensions, but provides a more striking view of the start differences in housing prices across such a small area. In fact, in the 3D you can see the river cutting the poorest section of town in two, with the more wealthy side of the river have a smaller portion of the economically challenged section.
Now we can compare the crime rates in a similar fashion:

plot_ly(Boston, 
        x=~LON, 
        y=~LAT, 
        z = ~CRIM, 
        type="scatter3d", 
        marker= list(size = 3, 
                     color = ~CRIM,
                     colorscale =c(RdYlBu[11],RdYlBu[4],RdYlBu[1]),
                     showscale=T)
        ) %>%
  layout(scene = list(
    xaxis=list(title="Longitude"),
    yaxis=list(title="Latitude"),
    zaxis=list(title="Crime Rate")
  ))



We can see here that the spot with the highest crime rate lines up again with the poorest area of the city.

Principle Component Analysis


Finally, we can complete a small Principle Component Analysis (PCA). This commonly used on “wide” datasets, those with many variables. Through a mathematical process, we can take all of these variables and plot them along “principle components”, allowing us to search for the variables which have the strongest correlation across the dataset, and those which account for the largest variation within it.
This can be quite useful in real-world application, as we often encounter large datasets with many variables, and picking through them all for correlation and variation can be tedious. This method, especially using R, makes it quite simple to at least get a handle on the situation.
Oftentimes the mtcars dataset in RStudio is used to display PCA Analysis, so we will use the iris dataset, which holds fewer variables but will still display the usefulness of the code. This will also allow us to write code that is not as readily copied from the web (as I prefer to write my own code learning from documentation, etc.).

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris.pca <- prcomp(iris[,1:4],scale.=T,center = T)

summary(iris.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000



How do we interpret this initial summary? Here we can see that between the first and second principle component (PC1 and PC2), about 96% of the variance is captured. That is a good thing, and will allow us to complete a robust analysis. Let us view this visually.

ggbiplot(iris.pca,
         groups=iris$Species,
         ellipse = T,
         var.scale = 1,
         obs.scale = 1) +
  ggtitle("PCA of Iris Dataset")+
  xlab("PC1")+
  ylab("PC2")+
  theme(plot.title = element_text(hjust = 0.5))



Here we can see the principle component analysis visually displayed for a simply analysis, and we can view how the variables interact and how the different groups are separated along those variables. Without going into much detail, we can see how using R to complete what is a mathematically complex analysis involving linear algebra and matrices can be easily completed in a few steps using a couple lines of code.

Contact Information

Thank you for taking time to peruse my analysis. While this is a science, it can also fall into the category of an art form. If you would like to contact me for any reason, please feel free to reach me at my email:

Email:

Even if you would like to touch base on something unrelated to employment, feel free to reach out. While this is my portfolio, I am always interested to connect with others in my field, and am always open to constructive criticism.





GitHub: https://github.com/jpcullum

LinkedIn: https://www.linkedin.com/in/josh-cullum-74891722b/

Portfolio Website: https://jpcullum.github.io/

End Document